Hive
  1. Hive
  2. HIVE-5304

Hive results can depend on metastore's underlying datastore

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Metastore
    • Labels:
      None

      Description

      [removed old description]
      Hive JDOQL filter pushdown and direct SQL may end up pushing StringCol op 'SomeString' to underlying SQL datastore. However, the datastore may handle these differently based on the encoding and collation used for the columns of the database.
      So, query results can change depending on the underlying store for the metastore and its version.

      drop_partitions_filter.q test illustrates this problem. In byte order collation (proper way) USA is sorted before Uganda, but some collations may do it the other way, causing the test to fail.

      I am assuming that byte-order sort if the correct way to order things.
      Our MySQL script specifies _bin collation, which is byte-order; Postgres 9.1 and after, as far as I see, defaults to "C" collation, which is also byte-order.
      Derby seems to use byte-order by default, I didn't spend a lot of time on Derby.
      However, Postgres before 9.1 seems to default to "en_US.UTF8" and there's no way to change column collation in our script if database is already created.
      MySQL by default doesn't use _bin collation (on my machine), so if database is auto-created, the order of things is going to change.
      I didn't investigate MSSQL or Oracle.

      For now it seems that:
      1) Auto-create shouldn't be used.
      2) If old version of postgres (<9.1) is used, the collation should be set properly by whoever issues "create database" (that is not our script).
      3) We might want to add 'collate "C"' to varchar columns in the postgres script to ensure the correct collation; however, this will break the script for postgres <9.1.
      4) MSSQL and Oracle might warrant investigation.

        Activity

        Sergey Shelukhin created issue -
        Hide
        Sergey Shelukhin added a comment -

        Actually, all names - I see show indexes order also changes in some queries, etc.

        Show
        Sergey Shelukhin added a comment - Actually, all names - I see show indexes order also changes in some queries, etc.
        Sergey Shelukhin made changes -
        Field Original Value New Value
        Summary JDO and SQL filters can both return different results for string compares depending on underlying datastore there are various string encoding problems in metastore depending on underlying datastore
        Sergey Shelukhin made changes -
        Description Hive uses JDOQL filters to optimize partition retrieval; recently direct SQL was added to optimize it further. Both of these methods may end up pushing StringCol op 'SomeString' to underlying SQL datastore. Many paths also pushes order by-s, although these are not as problematic.
        The problem is that different datastores handle string compares differently.
        While testing on Postgres, I see that results in different things, from innocent like order changes in "show partitions", to more serious like
        {code}
         alter table ptestfilter drop partition (c>='US', d<='2')
        {code}
        in drop_partitions_filter.q - in Derby, with which the .q.out file was generated, it drops "c=Uganda/d=2"; this also passes on MySQL (I ran tests with autocreated db); on Postgres with a db from the script it doesn't.
        Looks like we need to enforce collation in partition names and part_key_values-es; both in the create scripts, as well as during autocreate (via package.jdo?)
        Hive uses JDOQL filters to optimize partition retrieval; recently direct SQL was added to optimize it further. Both of these methods may end up pushing StringCol op 'SomeString' to underlying SQL datastore. Many paths also pushes order by-s, although these are not as problematic.
        The problem is that different datastores handle string compares differently.
        While testing on Postgres, I see that results in different things, from innocent like order changes in "show partitions", to more serious like
        {code}
         alter table ptestfilter drop partition (c>='US', d<='2')
        {code}
        in drop_partitions_filter.q - in Derby, with which the .q.out file was generated, it drops "c=Uganda/d=2"; this also passes on MySQL (I ran tests with autocreated db); on Postgres with a db from the script it doesn't.
        Looks like we need to enforce collation in partition names and part_key_values-es; both in the create scripts, as well as during autocreate (via package.jdo?)

        EDIT:
        also affected are - show indexes. So all names need to be taken care of

        Then; describe_comment_nonascii.q fails against MySQL on autocreated db due to problems with commends.
        Sergey Shelukhin made changes -
        Summary there are various string encoding problems in metastore depending on underlying datastore Hive results can depend on metastore's underlying datastore, if autocreate is used
        Sergey Shelukhin made changes -
        Description Hive uses JDOQL filters to optimize partition retrieval; recently direct SQL was added to optimize it further. Both of these methods may end up pushing StringCol op 'SomeString' to underlying SQL datastore. Many paths also pushes order by-s, although these are not as problematic.
        The problem is that different datastores handle string compares differently.
        While testing on Postgres, I see that results in different things, from innocent like order changes in "show partitions", to more serious like
        {code}
         alter table ptestfilter drop partition (c>='US', d<='2')
        {code}
        in drop_partitions_filter.q - in Derby, with which the .q.out file was generated, it drops "c=Uganda/d=2"; this also passes on MySQL (I ran tests with autocreated db); on Postgres with a db from the script it doesn't.
        Looks like we need to enforce collation in partition names and part_key_values-es; both in the create scripts, as well as during autocreate (via package.jdo?)

        EDIT:
        also affected are - show indexes. So all names need to be taken care of

        Then; describe_comment_nonascii.q fails against MySQL on autocreated db due to problems with commends.
        [removed old description]
        Hive JDOQL filter pushdown and direct SQL may end up pushing StringCol op 'SomeString' to underlying SQL datastore. However, the datastore may handle these differently based on the encoding and collation used for the columns of the database.
        So, query results can change depending on the underlying store for the metastore, and OS used

        I am assuming that byte-order sort if the correct way to order things.
        Our MySQL script specifies _bin collation, and Postgres, as far as I see, defaults to "C"; both of those are byte-order collations.
        However, MySQL by default doesn't use _bin collation, so if database is auto-created, the order of things is going to change.
        Derby also uses the non-byte-order by

        drop_partitions_filter.q illustrates this problem. In byte order collation (proper way) USA is sorted before Uganda
        Sergey Shelukhin made changes -
        Summary Hive results can depend on metastore's underlying datastore, if autocreate is used Hive results can depend on metastore's underlying datastore
        Sergey Shelukhin made changes -
        Description [removed old description]
        Hive JDOQL filter pushdown and direct SQL may end up pushing StringCol op 'SomeString' to underlying SQL datastore. However, the datastore may handle these differently based on the encoding and collation used for the columns of the database.
        So, query results can change depending on the underlying store for the metastore, and OS used

        I am assuming that byte-order sort if the correct way to order things.
        Our MySQL script specifies _bin collation, and Postgres, as far as I see, defaults to "C"; both of those are byte-order collations.
        However, MySQL by default doesn't use _bin collation, so if database is auto-created, the order of things is going to change.
        Derby also uses the non-byte-order by

        drop_partitions_filter.q illustrates this problem. In byte order collation (proper way) USA is sorted before Uganda
        [removed old description]
        Hive JDOQL filter pushdown and direct SQL may end up pushing StringCol op 'SomeString' to underlying SQL datastore. However, the datastore may handle these differently based on the encoding and collation used for the columns of the database.
        So, query results can change depending on the underlying store for the metastore and its version.

        drop_partitions_filter.q test illustrates this problem. In byte order collation (proper way) USA is sorted before Uganda, but some collations may do it the other way, causing the test to fail.

        I am assuming that byte-order sort if the correct way to order things.
        Our MySQL script specifies _bin collation, which is byte-order; Postgres 9.1 and after, as far as I see, defaults to "C" collation, which is also byte-order.
        Derby seems to use byte-order by default, I didn't spend a lot of time on Derby.
        However, Postgres before 9.1 seems to default to "en_US.UTF8" and there's no way to change column collation in our script if database is already created.
        MySQL by default doesn't use _bin collation (on my machine), so if database is auto-created, the order of things is going to change.
        I didn't investigate MSSQL or Oracle.

        For now it seems that:
        1) Auto-create shouldn't be used.
        2) If old version of postgres (<9.1) is used, the collation should be set properly by whoever issues "create database" (that is not our script).
        3) We might want to add 'collate "C"' to varchar columns in the postgres script to ensure the correct collation; however, this will break the script for postgres <9.1.
        4) MSSQL and Oracle might warrant investigation.
        Hide
        Sergey Shelukhin added a comment -

        Ashutosh Chauhan I updated the JIRA description with detailed investigation results... just fyi

        Show
        Sergey Shelukhin added a comment - Ashutosh Chauhan I updated the JIRA description with detailed investigation results... just fyi
        Hide
        Sergey Shelukhin added a comment -

        Ashutosh Chauhan do you think we should document this for people who create databases?

        Show
        Sergey Shelukhin added a comment - Ashutosh Chauhan do you think we should document this for people who create databases?

          People

          • Assignee:
            Unassigned
            Reporter:
            Sergey Shelukhin
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development