Hive
  1. Hive
  2. HIVE-3678

Add metastore upgrade scripts for column stats schema changes

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: Metastore
    • Labels:
      None

      Description

      Add upgrade script for column statistics schema changes for Postgres/MySQL/Oracle/Derby

      1. HIVE-3678.1.patch.txt
        36 kB
        Shreepadma Venugopalan
      2. HIVE-3678.2.patch.txt
        36 kB
        Shreepadma Venugopalan
      3. HIVE-3678.3.patch.txt
        38 kB
        Shreepadma Venugopalan
      4. HIVE-3678.4.patch.txt
        38 kB
        Shreepadma Venugopalan

        Issue Links

          Activity

          Hide
          Ashutosh Chauhan added a comment -

          This issue is fixed and released as part of 0.10.0 release. If you find an issue which seems to be related to this one, please create a new jira and link this one with new jira.

          Show
          Ashutosh Chauhan added a comment - This issue is fixed and released as part of 0.10.0 release. If you find an issue which seems to be related to this one, please create a new jira and link this one with new jira.
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/)
          HIVE-3678 : Add metastore upgrade scripts for column stats schema changes (Shreepadma via Ashutosh Chauhan) (Revision 1414509)

          Result = ABORTED
          hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414509
          Files :

          • /hive/trunk/metastore/scripts/upgrade/derby/012-HIVE-1362.derby.sql
          • /hive/trunk/metastore/scripts/upgrade/derby/hive-schema-0.10.0.derby.sql
          • /hive/trunk/metastore/scripts/upgrade/derby/upgrade-0.9.0-to-0.10.0.derby.sql
          • /hive/trunk/metastore/scripts/upgrade/mysql/012-HIVE-1362.mysql.sql
          • /hive/trunk/metastore/scripts/upgrade/mysql/hive-schema-0.10.0.mysql.sql
          • /hive/trunk/metastore/scripts/upgrade/mysql/upgrade-0.9.0-to-0.10.0.mysql.sql
          • /hive/trunk/metastore/scripts/upgrade/oracle/012-HIVE-1362.oracle.sql
          • /hive/trunk/metastore/scripts/upgrade/oracle/hive-schema-0.10.0.oracle.sql
          • /hive/trunk/metastore/scripts/upgrade/postgres/012-HIVE-1362.postgres.sql
          • /hive/trunk/metastore/scripts/upgrade/postgres/hive-schema-0.10.0.postgres.sql
          • /hive/trunk/metastore/scripts/upgrade/postgres/upgrade-0.9.0-to-0.10.0.postgres.sql
          • /hive/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
          • /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java
          • /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java
          • /hive/trunk/metastore/src/model/package.jdo
          Show
          Hudson added a comment - Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/ ) HIVE-3678 : Add metastore upgrade scripts for column stats schema changes (Shreepadma via Ashutosh Chauhan) (Revision 1414509) Result = ABORTED hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414509 Files : /hive/trunk/metastore/scripts/upgrade/derby/012- HIVE-1362 .derby.sql /hive/trunk/metastore/scripts/upgrade/derby/hive-schema-0.10.0.derby.sql /hive/trunk/metastore/scripts/upgrade/derby/upgrade-0.9.0-to-0.10.0.derby.sql /hive/trunk/metastore/scripts/upgrade/mysql/012- HIVE-1362 .mysql.sql /hive/trunk/metastore/scripts/upgrade/mysql/hive-schema-0.10.0.mysql.sql /hive/trunk/metastore/scripts/upgrade/mysql/upgrade-0.9.0-to-0.10.0.mysql.sql /hive/trunk/metastore/scripts/upgrade/oracle/012- HIVE-1362 .oracle.sql /hive/trunk/metastore/scripts/upgrade/oracle/hive-schema-0.10.0.oracle.sql /hive/trunk/metastore/scripts/upgrade/postgres/012- HIVE-1362 .postgres.sql /hive/trunk/metastore/scripts/upgrade/postgres/hive-schema-0.10.0.postgres.sql /hive/trunk/metastore/scripts/upgrade/postgres/upgrade-0.9.0-to-0.10.0.postgres.sql /hive/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java /hive/trunk/metastore/src/model/package.jdo
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #1822 (See https://builds.apache.org/job/Hive-trunk-h0.21/1822/)
          HIVE-3678 : Add metastore upgrade scripts for column stats schema changes (Shreepadma via Ashutosh Chauhan) (Revision 1414509)

          Result = ABORTED
          hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414509
          Files :

          • /hive/trunk/metastore/scripts/upgrade/derby/012-HIVE-1362.derby.sql
          • /hive/trunk/metastore/scripts/upgrade/derby/hive-schema-0.10.0.derby.sql
          • /hive/trunk/metastore/scripts/upgrade/derby/upgrade-0.9.0-to-0.10.0.derby.sql
          • /hive/trunk/metastore/scripts/upgrade/mysql/012-HIVE-1362.mysql.sql
          • /hive/trunk/metastore/scripts/upgrade/mysql/hive-schema-0.10.0.mysql.sql
          • /hive/trunk/metastore/scripts/upgrade/mysql/upgrade-0.9.0-to-0.10.0.mysql.sql
          • /hive/trunk/metastore/scripts/upgrade/oracle/012-HIVE-1362.oracle.sql
          • /hive/trunk/metastore/scripts/upgrade/oracle/hive-schema-0.10.0.oracle.sql
          • /hive/trunk/metastore/scripts/upgrade/postgres/012-HIVE-1362.postgres.sql
          • /hive/trunk/metastore/scripts/upgrade/postgres/hive-schema-0.10.0.postgres.sql
          • /hive/trunk/metastore/scripts/upgrade/postgres/upgrade-0.9.0-to-0.10.0.postgres.sql
          • /hive/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
          • /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java
          • /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java
          • /hive/trunk/metastore/src/model/package.jdo
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #1822 (See https://builds.apache.org/job/Hive-trunk-h0.21/1822/ ) HIVE-3678 : Add metastore upgrade scripts for column stats schema changes (Shreepadma via Ashutosh Chauhan) (Revision 1414509) Result = ABORTED hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1414509 Files : /hive/trunk/metastore/scripts/upgrade/derby/012- HIVE-1362 .derby.sql /hive/trunk/metastore/scripts/upgrade/derby/hive-schema-0.10.0.derby.sql /hive/trunk/metastore/scripts/upgrade/derby/upgrade-0.9.0-to-0.10.0.derby.sql /hive/trunk/metastore/scripts/upgrade/mysql/012- HIVE-1362 .mysql.sql /hive/trunk/metastore/scripts/upgrade/mysql/hive-schema-0.10.0.mysql.sql /hive/trunk/metastore/scripts/upgrade/mysql/upgrade-0.9.0-to-0.10.0.mysql.sql /hive/trunk/metastore/scripts/upgrade/oracle/012- HIVE-1362 .oracle.sql /hive/trunk/metastore/scripts/upgrade/oracle/hive-schema-0.10.0.oracle.sql /hive/trunk/metastore/scripts/upgrade/postgres/012- HIVE-1362 .postgres.sql /hive/trunk/metastore/scripts/upgrade/postgres/hive-schema-0.10.0.postgres.sql /hive/trunk/metastore/scripts/upgrade/postgres/upgrade-0.9.0-to-0.10.0.postgres.sql /hive/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java /hive/trunk/metastore/src/model/package.jdo
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk and 0.10 Thanks, Shreepadma!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk and 0.10 Thanks, Shreepadma!
          Hide
          Ashutosh Chauhan added a comment -

          Thanks, Shreepadma for updating patch. Running tests now.

          Show
          Ashutosh Chauhan added a comment - Thanks, Shreepadma for updating patch. Running tests now.
          Hide
          Shreepadma Venugopalan added a comment -

          Uploaded patch rebased off tip of trunk. Thanks.

          Show
          Shreepadma Venugopalan added a comment - Uploaded patch rebased off tip of trunk. Thanks.
          Hide
          Ashutosh Chauhan added a comment -

          +1 patch looks good, though fails to apply cleanly. Can you update the patch on latest trunk?

          Show
          Ashutosh Chauhan added a comment - +1 patch looks good, though fails to apply cleanly. Can you update the patch on latest trunk?
          Hide
          Shreepadma Venugopalan added a comment -

          Updated patch is available on both JIRA and RB. Thanks.

          Show
          Shreepadma Venugopalan added a comment - Updated patch is available on both JIRA and RB. Thanks.
          Hide
          Shreepadma Venugopalan added a comment -

          @Ashutosh: I've uploaded a new patch which adds 2 varchar columns for storing BigDecimal low and high values. Thanks.

          Show
          Shreepadma Venugopalan added a comment - @Ashutosh: I've uploaded a new patch which adds 2 varchar columns for storing BigDecimal low and high values. Thanks.
          Hide
          Ashutosh Chauhan added a comment -

          Lets store numeric types with long/double. Can you add varchar column for BigDecimal in the patch. Lets finish the work while we are at it and get it in.

          Show
          Ashutosh Chauhan added a comment - Lets store numeric types with long/double. Can you add varchar column for BigDecimal in the patch. Lets finish the work while we are at it and get it in.
          Hide
          Shreepadma Venugopalan added a comment -

          @Ashutosh: If store long/double types as a varchar instead of storing it as a numeric type, we can avoid evolving the schema when we add a BigDecimal type. That's the only benefit I see for storing long/double as a varchar. However, I agree with you that we should avoid untyping data when possible. Let me know your thoughts.

          Show
          Shreepadma Venugopalan added a comment - @Ashutosh: If store long/double types as a varchar instead of storing it as a numeric type, we can avoid evolving the schema when we add a BigDecimal type. That's the only benefit I see for storing long/double as a varchar. However, I agree with you that we should avoid untyping data when possible. Let me know your thoughts.
          Hide
          Ashutosh Chauhan added a comment -

          Lets stick with long and double for numeric types (as it is at the moment), we shouldn't untype the data unless there is a strong reason for it. For BigDecimal, lets go with varchar column.

          Show
          Ashutosh Chauhan added a comment - Lets stick with long and double for numeric types (as it is at the moment), we shouldn't untype the data unless there is a strong reason for it. For BigDecimal, lets go with varchar column.
          Hide
          Shreepadma Venugopalan added a comment -

          @Ashutosh: Thanks for your comments. Do you think it makes sense to store numeric long/double/bigdecimal values in a varchar column? I don't see consistent BLOB/CLOB support across DB vendors and versions. If you agree, I'll make the change to store these numeric values in a varchar column and post a new patch. Thanks.

          Show
          Shreepadma Venugopalan added a comment - @Ashutosh: Thanks for your comments. Do you think it makes sense to store numeric long/double/bigdecimal values in a varchar column? I don't see consistent BLOB/CLOB support across DB vendors and versions. If you agree, I'll make the change to store these numeric values in a varchar column and post a new patch. Thanks.
          Hide
          Ashutosh Chauhan added a comment -

          @Shreepadma, Seems like you are suggesting to store BigDecimal statistics as plain strings in CLOBs. I think its a fair choice. As I am thinking more about choice between one json blob for all stats Vs different column for diff stat, I am inclining towards different columns. Apart from performance argument for (de)serialization, extra code that we have to write to do (de)serialization will present more opportunities to introduce bugs, its always good to avoid writing code unless its useful in substantial ways.
          In nutshell, I am +1 on the patch as it is. Unless, Carl has some other suggestions, lets go with it.

          Show
          Ashutosh Chauhan added a comment - @Shreepadma, Seems like you are suggesting to store BigDecimal statistics as plain strings in CLOBs. I think its a fair choice. As I am thinking more about choice between one json blob for all stats Vs different column for diff stat, I am inclining towards different columns. Apart from performance argument for (de)serialization, extra code that we have to write to do (de)serialization will present more opportunities to introduce bugs, its always good to avoid writing code unless its useful in substantial ways. In nutshell, I am +1 on the patch as it is. Unless, Carl has some other suggestions, lets go with it.
          Hide
          Shreepadma Venugopalan added a comment -

          @Ashutosh: My answers are inline.

          We can add two more column in M*ColumnStatistics table of type BigDecimal: BigDecimalLowValue and BigDecimalHighValue. But is BigDecimal type supported consistently across different DBs?
          Agreed, BigDecimal is not consistently supported across DBs. Hence we can't add a BigDecimal column consistently across DB vendors and versions easily.

          We can have these two columns of type Double, but then we loose precision.
          Yes, we can store BigDecimal and Long as Double but we will lose precision.

          We can store as plain strings in column of type varchar.
          The maximum number of digits after the decimal point in a BigDecimal number is unlimited for all practical purpose. If we stored it in a varchar, it could result in truncation of some digits following the decimal point in some cases, but this seems to be the only practical solution.

          We can store in json format in column of type varchar.
          The maximum number of digits in a BigDecimal number after the decimal point is unlimited for all practical purposes (Java allows nearly 2 billion digits after the decimal point). At this time, we collect MIN, MAX column values for numeric columns. If we stored BigDecimal value, we may exceed the varchar size limit and as a result truncate the JSON "blob". This would result in a malformed JSON object. Additionally we will also lose some of the column statistics.

          Show
          Shreepadma Venugopalan added a comment - @Ashutosh: My answers are inline. We can add two more column in M*ColumnStatistics table of type BigDecimal: BigDecimalLowValue and BigDecimalHighValue. But is BigDecimal type supported consistently across different DBs? Agreed, BigDecimal is not consistently supported across DBs. Hence we can't add a BigDecimal column consistently across DB vendors and versions easily. We can have these two columns of type Double, but then we loose precision. Yes, we can store BigDecimal and Long as Double but we will lose precision. We can store as plain strings in column of type varchar. The maximum number of digits after the decimal point in a BigDecimal number is unlimited for all practical purpose. If we stored it in a varchar, it could result in truncation of some digits following the decimal point in some cases, but this seems to be the only practical solution. We can store in json format in column of type varchar. The maximum number of digits in a BigDecimal number after the decimal point is unlimited for all practical purposes (Java allows nearly 2 billion digits after the decimal point). At this time, we collect MIN, MAX column values for numeric columns. If we stored BigDecimal value, we may exceed the varchar size limit and as a result truncate the JSON "blob". This would result in a malformed JSON object. Additionally we will also lose some of the column statistics.
          Hide
          Ashutosh Chauhan added a comment -

          Shreepadma,
          Any thoughts on how we are going to store low and high values for BigDecimal column? Seems like there are four possibilites:

          • We can add two more column in M*ColumnStatistics table of type BigDecimal: BigDecimalLowValue and BigDecimalHighValue. But is BigDecimal type supported consistently across different DBs?
          • We can have these two columns of type Double, but then we loose precision.
          • We can store as plain strings in column of type varchar.
          • We can store in json format in column of type varchar.

          What do you think will be the best way?

          I imagine there are other numeric types too for which neither of long or double will be a good choice. This question will come up again if we choose to add support for these later.

          Show
          Ashutosh Chauhan added a comment - Shreepadma, Any thoughts on how we are going to store low and high values for BigDecimal column? Seems like there are four possibilites: We can add two more column in M*ColumnStatistics table of type BigDecimal: BigDecimalLowValue and BigDecimalHighValue. But is BigDecimal type supported consistently across different DBs? We can have these two columns of type Double, but then we loose precision. We can store as plain strings in column of type varchar. We can store in json format in column of type varchar. What do you think will be the best way? I imagine there are other numeric types too for which neither of long or double will be a good choice. This question will come up again if we choose to add support for these later.
          Hide
          Shreepadma Venugopalan added a comment -

          Most users use multi byte variable width character set for storing varchar, so the number of characters available to store the JSON "blob" is less than 4000 characters and assuming an encoding of 3-4 bytes/character it is ~1000 characters. While 1000 characters may be sufficient today, it would not be as we add more fields to the JSON "blob" as a result of adding new types of statistics. When we exceed the 1000 character limit on the JSON blob, it is not clear how we would store the new types of statistics that we gather. While storing the statistics in a JSON "blob" may seem to help avoid schema upgrades in the short term (this is not entirely clear to me because I don't know what the max size of the initial JSON blob could be in the worst case, but I'm assuming this for the sake of the argument), it would complicate upgrade when we hit the 1000 character limit. At that point, we will have no easy, simple way to evolve the schema.

          Show
          Shreepadma Venugopalan added a comment - Most users use multi byte variable width character set for storing varchar, so the number of characters available to store the JSON "blob" is less than 4000 characters and assuming an encoding of 3-4 bytes/character it is ~1000 characters. While 1000 characters may be sufficient today, it would not be as we add more fields to the JSON "blob" as a result of adding new types of statistics. When we exceed the 1000 character limit on the JSON blob, it is not clear how we would store the new types of statistics that we gather. While storing the statistics in a JSON "blob" may seem to help avoid schema upgrades in the short term (this is not entirely clear to me because I don't know what the max size of the initial JSON blob could be in the worst case, but I'm assuming this for the sake of the argument), it would complicate upgrade when we hit the 1000 character limit. At that point, we will have no easy, simple way to evolve the schema.
          Hide
          Carl Steinbach added a comment -

          The problem with storing column statistics as a JSON blob in a varchar is some databases such as Oracle limit the size of varchar to 4000 bytes. While this may seem sufficient, JSON is verbose and the size of the JSON "blob" can exceed 4000 bytes, especially if we choose to evolve the statistics later.

          Given the stats fields that exist right now, what's the longest JSON blob that we expect to create? Does it come anywhere close to 4000 bytes in length?

          There is also additional serialization and deserialization cost while writing and reading statistics.

          This would be an issue if we were worried about filtering stats records based on stats values, but I don't think that's a practical use case.

          Given that the column statistics schema consists of simple data types and is portable across DB vendors and versions...

          The goal with this change is to make the schema even more portable by modifying it to depend on fewer datatypes.

          Show
          Carl Steinbach added a comment - The problem with storing column statistics as a JSON blob in a varchar is some databases such as Oracle limit the size of varchar to 4000 bytes. While this may seem sufficient, JSON is verbose and the size of the JSON "blob" can exceed 4000 bytes, especially if we choose to evolve the statistics later. Given the stats fields that exist right now, what's the longest JSON blob that we expect to create? Does it come anywhere close to 4000 bytes in length? There is also additional serialization and deserialization cost while writing and reading statistics. This would be an issue if we were worried about filtering stats records based on stats values, but I don't think that's a practical use case. Given that the column statistics schema consists of simple data types and is portable across DB vendors and versions... The goal with this change is to make the schema even more portable by modifying it to depend on fewer datatypes.
          Hide
          Shreepadma Venugopalan added a comment -

          The problem with storing column statistics as a JSON blob in a varchar is some databases such as Oracle limit the size of varchar to 4000 bytes. While this may seem sufficient, JSON is verbose and the size of the JSON "blob" can exceed 4000 bytes, especially if we choose to evolve the statistics later. There is also additional serialization and deserialization cost while writing and reading statistics. Given that the column statistics schema consists of simple data types and is portable across DB vendors and versions, I'd prefer to keep things the way they are.

          Show
          Shreepadma Venugopalan added a comment - The problem with storing column statistics as a JSON blob in a varchar is some databases such as Oracle limit the size of varchar to 4000 bytes. While this may seem sufficient, JSON is verbose and the size of the JSON "blob" can exceed 4000 bytes, especially if we choose to evolve the statistics later. There is also additional serialization and deserialization cost while writing and reading statistics. Given that the column statistics schema consists of simple data types and is portable across DB vendors and versions, I'd prefer to keep things the way they are.
          Hide
          Ashutosh Chauhan added a comment -

          I agree with Carl, making it easier to evolve such that its independent of exact type will be a win. We already have one such use-case with BigDecimal support being added over on HIVE-2693.

          Also, following looks unintentional change.

           -- Constraints for table PARTITION_KEYS
          -ALTER TABLE PARTITION_KEYS ADD CONSTRAINT PARTITION_KEYS_FK1 FOREIGN KEY (TBL_ID) REFERENCES TBLS (TBL_ID) INITIALLY DEFERRED ;
          +ALTER TABLE PARTITION_KEYS ADD CONSTRAINT PARTITION_KEYS_FK1 FOREIGN KEY (TBTB_ID) REFERENCES TBLS (TBL_ID) INITIALLY DEFERRED ;
          
          Show
          Ashutosh Chauhan added a comment - I agree with Carl, making it easier to evolve such that its independent of exact type will be a win. We already have one such use-case with BigDecimal support being added over on HIVE-2693 . Also, following looks unintentional change. -- Constraints for table PARTITION_KEYS -ALTER TABLE PARTITION_KEYS ADD CONSTRAINT PARTITION_KEYS_FK1 FOREIGN KEY (TBL_ID) REFERENCES TBLS (TBL_ID) INITIALLY DEFERRED ; +ALTER TABLE PARTITION_KEYS ADD CONSTRAINT PARTITION_KEYS_FK1 FOREIGN KEY (TBTB_ID) REFERENCES TBLS (TBL_ID) INITIALLY DEFERRED ;
          Hide
          Carl Steinbach added a comment -

          Sorry for the confusion. When I wrote "blob" I was trying to convey only that the field will be opaque to the DB (since it's a JSON struct), not that it will actually be stored in a BLOB column. If we store the JSON struct in a VARCHAR we have at least 4000 bytes to work with.

          Show
          Carl Steinbach added a comment - Sorry for the confusion. When I wrote "blob" I was trying to convey only that the field will be opaque to the DB (since it's a JSON struct), not that it will actually be stored in a BLOB column. If we store the JSON struct in a VARCHAR we have at least 4000 bytes to work with.
          Hide
          Shreepadma Venugopalan added a comment -

          With the changes from HIVE-3712, the column schema has no dependency on any specific db. The column schema, with the changes from HIVE-3712, uses simple data types, which are supported across DBs. The primary motivation for making the change to the schema in HIVE-3712 was to avoid storing column statistics fields as a BLOB. The problem with using a BLOB is a) BLOBs are designed to store large volumes of data in the order of GBs and are hence stored outside the row. A consequence of this design is BLOBs don't perform well for storing small amounts of data. While some DBs such as Oracle inline small BLOBs, all DBs don't. While BLOBs are the only practical choice for storing data whose size is not known in advance, it is an overkill for storing around 100 bytes of data, and b) there is no uniform support across DB vendors and versions. Hence I don't really see the value in storing this as a JSON BLOB.

          Show
          Shreepadma Venugopalan added a comment - With the changes from HIVE-3712 , the column schema has no dependency on any specific db. The column schema, with the changes from HIVE-3712 , uses simple data types, which are supported across DBs. The primary motivation for making the change to the schema in HIVE-3712 was to avoid storing column statistics fields as a BLOB. The problem with using a BLOB is a) BLOBs are designed to store large volumes of data in the order of GBs and are hence stored outside the row. A consequence of this design is BLOBs don't perform well for storing small amounts of data. While some DBs such as Oracle inline small BLOBs, all DBs don't. While BLOBs are the only practical choice for storing data whose size is not known in advance, it is an overkill for storing around 100 bytes of data, and b) there is no uniform support across DB vendors and versions. Hence I don't really see the value in storing this as a JSON BLOB.
          Hide
          Carl Steinbach added a comment -

          The upgrade scripts look good to me. As for HIVE-3712 which is included in this patch, I have started to wonder if it would be better for the metastore DB to store the column stats values (e.g. min/max value, num trues/falses, min/max/avg length, etc) as a JSON text blob. This approach would make the code more portable by eliminating dependencies on specific DBs and will also make it easier to add new fields in the future. The big downside of this approach is that we won't be able to push down column stats filters on these fields, but I'm not convinced that this is a practical use case in the first place.

          Show
          Carl Steinbach added a comment - The upgrade scripts look good to me. As for HIVE-3712 which is included in this patch, I have started to wonder if it would be better for the metastore DB to store the column stats values (e.g. min/max value, num trues/falses, min/max/avg length, etc) as a JSON text blob. This approach would make the code more portable by eliminating dependencies on specific DBs and will also make it easier to add new fields in the future. The big downside of this approach is that we won't be able to push down column stats filters on these fields, but I'm not convinced that this is a practical use case in the first place.
          Hide
          Shreepadma Venugopalan added a comment -
          Show
          Shreepadma Venugopalan added a comment - Review board link: https://reviews.apache.org/r/8119/
          Hide
          Gang Tim Liu added a comment -

          sure. no problem.

          Show
          Gang Tim Liu added a comment - sure. no problem.
          Hide
          Shreepadma Venugopalan added a comment -

          @Tim: I'm currently working on providing the upgrade scripts for different databases. Since there is a plan to release Hive 0.10 soon, we have to provide upgrade scripts for all of them. Thanks.

          Show
          Shreepadma Venugopalan added a comment - @Tim: I'm currently working on providing the upgrade scripts for different databases. Since there is a plan to release Hive 0.10 soon, we have to provide upgrade scripts for all of them. Thanks.
          Hide
          Gang Tim Liu added a comment -

          Shreepadma Venugopalan thank you for working on it.

          Do you have plan when you submit a patch? I know it takes efforts to code and test in 4 different types of databases.

          Would you please release mysql version first? Then file a jira to follow up Postgres/Oracle/Derby. Just a suggestion for fast release but up to you to decide.

          thanks a lot again Tim

          Show
          Gang Tim Liu added a comment - Shreepadma Venugopalan thank you for working on it. Do you have plan when you submit a patch? I know it takes efforts to code and test in 4 different types of databases. Would you please release mysql version first? Then file a jira to follow up Postgres/Oracle/Derby. Just a suggestion for fast release but up to you to decide. thanks a lot again Tim

            People

            • Assignee:
              Shreepadma Venugopalan
              Reporter:
              Shreepadma Venugopalan
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development