Hive
  1. Hive
  2. HIVE-2246

Dedupe tables' column schemas from partitions in the metastore db

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: Metastore
    • Labels:
      None
    • Release Note:
      This makes an incompatible change in the metastore DB table schema from previous versions (<0.8). Older metastores created with previous versions of Hive will need to be upgraded with the supplied scripts.
    • Tags:
      metastore, schema, JDO

      Description

      Note: this patch proposes a schema change, and is therefore incompatible with the current metastore.

      We can re-organize the JDO models to reduce space usage to keep the metastore scalable for the future. Currently, partitions are the fastest growing objects in the metastore, and the metastore keeps a separate copy of the columns list for each partition. We can normalize the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate lists of the columns for each partition.

      An idea is to create an additional level of indirection with a "Column Descriptor" that has a list of columns. A table has a reference to its latest Column Descriptor (note: a table may have more than one Column Descriptor in the case of schema evolution). Partitions and Indexes can reference the same Column Descriptors as their parent table.

      Currently, the COLUMNS table in the metastore has roughly (number of partitions + number of tables) * (average number of columns pertable) rows. We can reduce this to (number of tables) * (average number of columns per table) rows, while incurring a small cost proportional to the number of tables to store the Column Descriptors.

      Please see the latest review board for additional implementation details.

      1. HIVE-2246.8.patch
        29 kB
        Sohan Jain
      2. HIVE-2246.4.patch
        29 kB
        Sohan Jain
      3. HIVE-2246.3.patch
        23 kB
        Sohan Jain
      4. HIVE-2246.2.patch
        15 kB
        Sohan Jain

        Issue Links

          Activity

          Sohan Jain created issue -
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/985/
          -----------------------------------------------------------

          Review request for hive.

          Summary
          -------

          We can re-organize the JDO models to reduce space usage to keep the metastore scalable for the future. Currently, partitions are the fastest growing objects in the metastore, and the metastore keeps a separate copy of the columns list for each partition. We can normalize the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate lists of the columns for each partition.

          An idea is to create an additional level of indirection with a "Column Descriptor" that has a list of columns. A table has a reference to its latest Column Descriptor (note: a table may have more than one Column Descriptor in the case of schema evolution). Partitions and Indexes can reference the same Column Descriptors as their parent table.

          Currently, the COLUMNS table in the metastore has roughly (number of partitions + number of tables) * (average number of columns pertable) rows. We can reduce this to (number of tables) * (average number of columns per table) rows, while incurring a small cost proportional to the number of tables to store the Column Descriptors.

          This addresses bug HIVE-2246.
          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs


          trunk/metastore/if/hive_metastore.thrift 1140399
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MDatabase.java 1140399
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MFieldSchema.java 1140399
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MIndex.java 1140399
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartition.java 1140399
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1140399
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MTable.java 1140399
          trunk/metastore/src/model/package.jdo 1140399
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1140399
          trunk/ql/src/java/org/apache/hadoop/hive/ql/index/TableBasedIndexHandler.java 1140399
          trunk/ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java 1140399
          trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1140399
          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 1140399
          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1140399
          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 1140399
          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 1140399
          trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 1140399

          Diff: https://reviews.apache.org/r/985/diff

          Testing
          -------

          Haven't run any unit tests yet, just qualitative testing so far.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/985/ ----------------------------------------------------------- Review request for hive. Summary ------- We can re-organize the JDO models to reduce space usage to keep the metastore scalable for the future. Currently, partitions are the fastest growing objects in the metastore, and the metastore keeps a separate copy of the columns list for each partition. We can normalize the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate lists of the columns for each partition. An idea is to create an additional level of indirection with a "Column Descriptor" that has a list of columns. A table has a reference to its latest Column Descriptor (note: a table may have more than one Column Descriptor in the case of schema evolution). Partitions and Indexes can reference the same Column Descriptors as their parent table. Currently, the COLUMNS table in the metastore has roughly (number of partitions + number of tables) * (average number of columns pertable) rows. We can reduce this to (number of tables) * (average number of columns per table) rows, while incurring a small cost proportional to the number of tables to store the Column Descriptors. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs trunk/metastore/if/hive_metastore.thrift 1140399 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MDatabase.java 1140399 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MFieldSchema.java 1140399 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MIndex.java 1140399 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartition.java 1140399 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1140399 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MTable.java 1140399 trunk/metastore/src/model/package.jdo 1140399 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1140399 trunk/ql/src/java/org/apache/hadoop/hive/ql/index/TableBasedIndexHandler.java 1140399 trunk/ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java 1140399 trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java 1140399 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 1140399 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1140399 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 1140399 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 1140399 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 1140399 Diff: https://reviews.apache.org/r/985/diff Testing ------- Haven't run any unit tests yet, just qualitative testing so far. Thanks, Sohan
          Hide
          Sohan Jain added a comment -

          See the latest review board. This patch is incompatible with the current metastore and requires schema migration.

          Show
          Sohan Jain added a comment - See the latest review board. This patch is incompatible with the current metastore and requires schema migration.
          Sohan Jain made changes -
          Field Original Value New Value
          Attachment HIVE-2246.2.patch [ 12487393 ]
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/
          -----------------------------------------------------------

          Review request for hive, Ning Zhang and Paul Yang.

          Summary
          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          • CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.
          • COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.
          • SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.
          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs


          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945
          trunk/metastore/src/model/package.jdo 1148945

          Diff: https://reviews.apache.org/r/1183/diff

          Testing
          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- Review request for hive, Ning Zhang and Paul Yang. Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945 trunk/metastore/src/model/package.jdo 1148945 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Sohan Jain made changes -
          Tags metastore, schema, JDO
          Description We can re-organize the JDO models to reduce space usage to keep the metastore scalable for the future. Currently, partitions are the fastest growing objects in the metastore, and the metastore keeps a separate copy of the columns list for each partition. We can normalize the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate lists of the columns for each partition.

          An idea is to create an additional level of indirection with a "Column Descriptor" that has a list of columns. A table has a reference to its latest Column Descriptor (note: a table may have more than one Column Descriptor in the case of schema evolution). Partitions and Indexes can reference the same Column Descriptors as their parent table.

          Currently, the COLUMNS table in the metastore has roughly (number of partitions + number of tables) * (average number of columns pertable) rows. We can reduce this to (number of tables) * (average number of columns per table) rows, while incurring a small cost proportional to the number of tables to store the Column Descriptors.
          Note: this patch proposes a schema change, and is therefore incompatible with the current metastore.

          We can re-organize the JDO models to reduce space usage to keep the metastore scalable for the future. Currently, partitions are the fastest growing objects in the metastore, and the metastore keeps a separate copy of the columns list for each partition. We can normalize the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate lists of the columns for each partition.

          An idea is to create an additional level of indirection with a "Column Descriptor" that has a list of columns. A table has a reference to its latest Column Descriptor (note: a table may have more than one Column Descriptor in the case of schema evolution). Partitions and Indexes can reference the same Column Descriptors as their parent table.

          Currently, the COLUMNS table in the metastore has roughly (number of partitions + number of tables) * (average number of columns pertable) rows. We can reduce this to (number of tables) * (average number of columns per table) rows, while incurring a small cost proportional to the number of tables to store the Column Descriptors.

          Please see the latest review board for additional implementation details.
          Hide
          Sohan Jain added a comment -

          Adding some missing files that I forgot to svn add

          Show
          Sohan Jain added a comment - Adding some missing files that I forgot to svn add
          Sohan Jain made changes -
          Attachment HIVE-2246.3.patch [ 12487394 ]
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/
          -----------------------------------------------------------

          (Updated 2011-07-22 05:30:29.026246)

          Review request for hive, Ning Zhang and Paul Yang.

          Changes
          -------

          Adding some files I missed in the last diff.

          Summary
          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          • CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.
          • COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.
          • SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.
          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs (updated)


          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945
          trunk/metastore/src/model/package.jdo 1148945

          Diff: https://reviews.apache.org/r/1183/diff

          Testing
          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-07-22 05:30:29.026246) Review request for hive, Ning Zhang and Paul Yang. Changes ------- Adding some files I missed in the last diff. Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs (updated) trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945 trunk/metastore/src/model/package.jdo 1148945 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/#review1176
          -----------------------------------------------------------

          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          <https://reviews.apache.org/r/1183/#comment2467>

          is the CHARSET (latin1) the same as SDS? This will require the user's comments to be in latin1 which prevents UTF chars.

          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          <https://reviews.apache.org/r/1183/#comment2466>

          can you also add migration script for derby? we support derby as a default metastore RDBMS as well.

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
          <https://reviews.apache.org/r/1183/#comment2468>

          here do you check if the 'alter table' command changes the schema (columns definition)? If it just set a table property, then you don't need to create a new ColumnDescriptor right?

          Also if a table's schema got changed, a new CD will be created, but the old partition will still have the old CDs. When we query the old partition, do we use the old partitons's CD or the table's CD?

          Also in the above case, when you run 'desc table partition <old_partition>', do you return the old partition's CD or the table's CD?

          • Ning

          On 2011-07-22 05:30:29, Sohan Jain wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1183/

          -----------------------------------------------------------

          (Updated 2011-07-22 05:30:29)

          Review request for hive, Ning Zhang and Paul Yang.

          Summary

          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.

          - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.

          - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.

          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs

          -----

          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945

          trunk/metastore/src/model/package.jdo 1148945

          Diff: https://reviews.apache.org/r/1183/diff

          Testing

          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/#review1176 ----------------------------------------------------------- trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql < https://reviews.apache.org/r/1183/#comment2467 > is the CHARSET (latin1) the same as SDS? This will require the user's comments to be in latin1 which prevents UTF chars. trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql < https://reviews.apache.org/r/1183/#comment2466 > can you also add migration script for derby? we support derby as a default metastore RDBMS as well. trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java < https://reviews.apache.org/r/1183/#comment2468 > here do you check if the 'alter table' command changes the schema (columns definition)? If it just set a table property, then you don't need to create a new ColumnDescriptor right? Also if a table's schema got changed, a new CD will be created, but the old partition will still have the old CDs. When we query the old partition, do we use the old partitons's CD or the table's CD? Also in the above case, when you run 'desc table partition <old_partition>', do you return the old partition's CD or the table's CD? Ning On 2011-07-22 05:30:29, Sohan Jain wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-07-22 05:30:29) Review request for hive, Ning Zhang and Paul Yang. Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs ----- trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945 trunk/metastore/src/model/package.jdo 1148945 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-07-25 06:46:04, Ning Zhang wrote:

          > trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql, line 76

          > <https://reviews.apache.org/r/1183/diff/2/?file=26824#file26824line76>

          >

          > is the CHARSET (latin1) the same as SDS? This will require the user's comments to be in latin1 which prevents UTF chars.

          Yes, this charset matches the same ones from the official hive schema for 0.7.0.

          On 2011-07-25 06:46:04, Ning Zhang wrote:

          > trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql, line 206

          > <https://reviews.apache.org/r/1183/diff/2/?file=26824#file26824line206>

          >

          > can you also add migration script for derby? we support derby as a default metastore RDBMS as well.

          Ok, will do. I will add it in the next-next diff here.

          On 2011-07-25 06:46:04, Ning Zhang wrote:

          > trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java, line 1752

          > <https://reviews.apache.org/r/1183/diff/2/?file=26825#file26825line1752>

          >

          > here do you check if the 'alter table' command changes the schema (columns definition)? If it just set a table property, then you don't need to create a new ColumnDescriptor right?

          >

          > Also if a table's schema got changed, a new CD will be created, but the old partition will still have the old CDs. When we query the old partition, do we use the old partitons's CD or the table's CD?

          >

          > Also in the above case, when you run 'desc table partition <old_partition>', do you return the old partition's CD or the table's CD?

          Good point; I should check whether the table columns have changed; I do this already when altering partitions. I added that in the next diff.

          If a table's schema changes, it does not update existing partition CDs. If we ever grab the partition object after the schema change, it will refer to its old CD, not the table's CD. However, when querying tables on the CLI, we almost always use the table's set of columns. E.g., if did:

          create table test (a string) partitioned by (p1 string, p2 string);

          alter table test add partition(p1=1, p2=1);

          # populate the p1=1, p2=1 partition with some data now

          alter table test add columns (b string)

          select * from test where p1 = 1 and p2 = 1,

          it'd use the table's latest schema; i.e., return the column 'a's values and the column 'b' as all NULL.

          • Sohan

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/#review1176
          -----------------------------------------------------------

          On 2011-07-22 05:30:29, Sohan Jain wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1183/

          -----------------------------------------------------------

          (Updated 2011-07-22 05:30:29)

          Review request for hive, Ning Zhang and Paul Yang.

          Summary

          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.

          - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.

          - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.

          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs

          -----

          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945

          trunk/metastore/src/model/package.jdo 1148945

          Diff: https://reviews.apache.org/r/1183/diff

          Testing

          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-07-25 06:46:04, Ning Zhang wrote: > trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql, line 76 > < https://reviews.apache.org/r/1183/diff/2/?file=26824#file26824line76 > > > is the CHARSET (latin1) the same as SDS? This will require the user's comments to be in latin1 which prevents UTF chars. Yes, this charset matches the same ones from the official hive schema for 0.7.0. On 2011-07-25 06:46:04, Ning Zhang wrote: > trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql, line 206 > < https://reviews.apache.org/r/1183/diff/2/?file=26824#file26824line206 > > > can you also add migration script for derby? we support derby as a default metastore RDBMS as well. Ok, will do. I will add it in the next-next diff here. On 2011-07-25 06:46:04, Ning Zhang wrote: > trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java, line 1752 > < https://reviews.apache.org/r/1183/diff/2/?file=26825#file26825line1752 > > > here do you check if the 'alter table' command changes the schema (columns definition)? If it just set a table property, then you don't need to create a new ColumnDescriptor right? > > Also if a table's schema got changed, a new CD will be created, but the old partition will still have the old CDs. When we query the old partition, do we use the old partitons's CD or the table's CD? > > Also in the above case, when you run 'desc table partition <old_partition>', do you return the old partition's CD or the table's CD? Good point; I should check whether the table columns have changed; I do this already when altering partitions. I added that in the next diff. If a table's schema changes, it does not update existing partition CDs. If we ever grab the partition object after the schema change, it will refer to its old CD, not the table's CD. However, when querying tables on the CLI, we almost always use the table's set of columns. E.g., if did: create table test (a string) partitioned by (p1 string, p2 string); alter table test add partition(p1=1, p2=1); # populate the p1=1, p2=1 partition with some data now alter table test add columns (b string) select * from test where p1 = 1 and p2 = 1, it'd use the table's latest schema; i.e., return the column 'a's values and the column 'b' as all NULL. Sohan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/#review1176 ----------------------------------------------------------- On 2011-07-22 05:30:29, Sohan Jain wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-07-22 05:30:29) Review request for hive, Ning Zhang and Paul Yang. Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs ----- trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945 trunk/metastore/src/model/package.jdo 1148945 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          On 2011-07-25 06:46:04, Ning Zhang wrote:

          > trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java, line 1752

          > <https://reviews.apache.org/r/1183/diff/2/?file=26825#file26825line1752>

          >

          > here do you check if the 'alter table' command changes the schema (columns definition)? If it just set a table property, then you don't need to create a new ColumnDescriptor right?

          >

          > Also if a table's schema got changed, a new CD will be created, but the old partition will still have the old CDs. When we query the old partition, do we use the old partitons's CD or the table's CD?

          >

          > Also in the above case, when you run 'desc table partition <old_partition>', do you return the old partition's CD or the table's CD?

          Sohan Jain wrote:

          Good point; I should check whether the table columns have changed; I do this already when altering partitions. I added that in the next diff.

          If a table's schema changes, it does not update existing partition CDs. If we ever grab the partition object after the schema change, it will refer to its old CD, not the table's CD. However, when querying tables on the CLI, we almost always use the table's set of columns. E.g., if did:

          > create table test (a string) partitioned by (p1 string, p2 string);

          > alter table test add partition(p1=1, p2=1);

          > # populate the p1=1, p2=1 partition with some data now

          > alter table test add columns (b string)

          > select * from test where p1 = 1 and p2 = 1,

          it'd use the table's latest schema; i.e., return the column 'a's values and the column 'b' as all NULL.

          Also, I fixed the "desc table partition" to use the partition's column schema, not the table's.

          • Sohan

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/#review1176
          -----------------------------------------------------------

          On 2011-07-22 05:30:29, Sohan Jain wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1183/

          -----------------------------------------------------------

          (Updated 2011-07-22 05:30:29)

          Review request for hive, Ning Zhang and Paul Yang.

          Summary

          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.

          - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.

          - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.

          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs

          -----

          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945

          trunk/metastore/src/model/package.jdo 1148945

          Diff: https://reviews.apache.org/r/1183/diff

          Testing

          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - On 2011-07-25 06:46:04, Ning Zhang wrote: > trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java, line 1752 > < https://reviews.apache.org/r/1183/diff/2/?file=26825#file26825line1752 > > > here do you check if the 'alter table' command changes the schema (columns definition)? If it just set a table property, then you don't need to create a new ColumnDescriptor right? > > Also if a table's schema got changed, a new CD will be created, but the old partition will still have the old CDs. When we query the old partition, do we use the old partitons's CD or the table's CD? > > Also in the above case, when you run 'desc table partition <old_partition>', do you return the old partition's CD or the table's CD? Sohan Jain wrote: Good point; I should check whether the table columns have changed; I do this already when altering partitions. I added that in the next diff. If a table's schema changes, it does not update existing partition CDs. If we ever grab the partition object after the schema change, it will refer to its old CD, not the table's CD. However, when querying tables on the CLI, we almost always use the table's set of columns. E.g., if did: > create table test (a string) partitioned by (p1 string, p2 string); > alter table test add partition(p1=1, p2=1); > # populate the p1=1, p2=1 partition with some data now > alter table test add columns (b string) > select * from test where p1 = 1 and p2 = 1, it'd use the table's latest schema; i.e., return the column 'a's values and the column 'b' as all NULL. Also, I fixed the "desc table partition" to use the partition's column schema, not the table's. Sohan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/#review1176 ----------------------------------------------------------- On 2011-07-22 05:30:29, Sohan Jain wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-07-22 05:30:29) Review request for hive, Ning Zhang and Paul Yang. Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs ----- trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1148945 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1148945 trunk/metastore/src/model/package.jdo 1148945 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/
          -----------------------------------------------------------

          (Updated 2011-08-05 20:48:05.144312)

          Review request for hive, Ning Zhang and Paul Yang.

          Changes
          -------

          -On alter table, only change the column descriptor if the columns have changed.
          -Fix "desc table partition..." to use the partition's column schema, not the table's

          Summary
          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          • CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.
          • COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.
          • SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.
          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs (updated)


          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927
          trunk/metastore/src/model/package.jdo 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927

          Diff: https://reviews.apache.org/r/1183/diff

          Testing
          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-08-05 20:48:05.144312) Review request for hive, Ning Zhang and Paul Yang. Changes ------- -On alter table, only change the column descriptor if the columns have changed. -Fix "desc table partition..." to use the partition's column schema, not the table's Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs (updated) trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927 trunk/metastore/src/model/package.jdo 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/
          -----------------------------------------------------------

          (Updated 2011-08-05 20:49:19.127572)

          Review request for hive, Ning Zhang and Paul Yang.

          Changes
          -------

          -Forgot to add a few files. NOTE: this is only a temporary diff; I need to add derby and postgres migration scripts.

          Summary
          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          • CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.
          • COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.
          • SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.
          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs (updated)


          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927
          trunk/metastore/src/model/package.jdo 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927

          Diff: https://reviews.apache.org/r/1183/diff

          Testing
          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-08-05 20:49:19.127572) Review request for hive, Ning Zhang and Paul Yang. Changes ------- -Forgot to add a few files. NOTE: this is only a temporary diff; I need to add derby and postgres migration scripts. Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs (updated) trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927 trunk/metastore/src/model/package.jdo 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Sohan Jain made changes -
          Attachment HIVE-2246.4.patch [ 12489531 ]
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/#review1309
          -----------------------------------------------------------

          Also, can you add migration scripts for other DB's?

          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          <https://reviews.apache.org/r/1183/#comment2982>

          Typo

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
          <https://reviews.apache.org/r/1183/#comment2979>

          The check and the delete should in the same transaction, as it's possible for a reference to a CD to be created after the check but before the delete.

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
          <https://reviews.apache.org/r/1183/#comment2981>

          How does this drop the storage descriptor?

          trunk/metastore/src/model/package.jdo
          <https://reviews.apache.org/r/1183/#comment2968>

          Fix indent

          • Paul

          On 2011-08-05 20:49:19, Sohan Jain wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1183/

          -----------------------------------------------------------

          (Updated 2011-08-05 20:49:19)

          Review request for hive, Ning Zhang and Paul Yang.

          Summary

          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.

          - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.

          - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.

          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs

          -----

          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927

          trunk/metastore/src/model/package.jdo 1153927

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927

          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927

          Diff: https://reviews.apache.org/r/1183/diff

          Testing

          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/#review1309 ----------------------------------------------------------- Also, can you add migration scripts for other DB's? trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql < https://reviews.apache.org/r/1183/#comment2982 > Typo trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java < https://reviews.apache.org/r/1183/#comment2979 > The check and the delete should in the same transaction, as it's possible for a reference to a CD to be created after the check but before the delete. trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java < https://reviews.apache.org/r/1183/#comment2981 > How does this drop the storage descriptor? trunk/metastore/src/model/package.jdo < https://reviews.apache.org/r/1183/#comment2968 > Fix indent Paul On 2011-08-05 20:49:19, Sohan Jain wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-08-05 20:49:19) Review request for hive, Ning Zhang and Paul Yang. Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs ----- trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927 trunk/metastore/src/model/package.jdo 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/
          -----------------------------------------------------------

          (Updated 2011-08-06 01:40:49.118616)

          Review request for hive, Ning Zhang and Paul Yang.

          Changes
          -------

          -made listStorageDescriptors.. into one transaction
          -renamed dropStorageDescriptorCleanly to make it's functionality clearer
          -indents & typo

          Summary
          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          • CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.
          • COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.
          • SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.
          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs (updated)


          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927
          trunk/metastore/src/model/package.jdo 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927

          Diff: https://reviews.apache.org/r/1183/diff

          Testing
          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-08-06 01:40:49.118616) Review request for hive, Ning Zhang and Paul Yang. Changes ------- -made listStorageDescriptors.. into one transaction -renamed dropStorageDescriptorCleanly to make it's functionality clearer -indents & typo Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs (updated) trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927 trunk/metastore/src/model/package.jdo 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/#review1313
          -----------------------------------------------------------

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
          <https://reviews.apache.org/r/1183/#comment2984>

          should read 1-N actually

          • Sohan

          On 2011-08-06 01:40:49, Sohan Jain wrote:

          -----------------------------------------------------------

          This is an automatically generated e-mail. To reply, visit:

          https://reviews.apache.org/r/1183/

          -----------------------------------------------------------

          (Updated 2011-08-06 01:40:49)

          Review request for hive, Ning Zhang and Paul Yang.

          Summary

          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.

          - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.

          - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.

          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs

          -----

          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927

          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION

          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927

          trunk/metastore/src/model/package.jdo 1153927

          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927

          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927

          Diff: https://reviews.apache.org/r/1183/diff

          Testing

          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/#review1313 ----------------------------------------------------------- trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java < https://reviews.apache.org/r/1183/#comment2984 > should read 1-N actually Sohan On 2011-08-06 01:40:49, Sohan Jain wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-08-06 01:40:49) Review request for hive, Ning Zhang and Paul Yang. Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: - CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. - COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. - SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs ----- trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927 trunk/metastore/src/model/package.jdo 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/
          -----------------------------------------------------------

          (Updated 2011-08-08 20:55:11.546253)

          Review request for hive, Ning Zhang and Paul Yang.

          Changes
          -------

          added derby upgrade and revert-the-upgrade script

          Summary
          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          • CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.
          • COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.
          • SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When adding or altering a table, create a new column descriptor every time.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.
          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs (updated)


          trunk/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql PRE-CREATION
          trunk/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql PRE-CREATION
          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927
          trunk/metastore/src/model/package.jdo 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927

          Diff: https://reviews.apache.org/r/1183/diff

          Testing
          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-08-08 20:55:11.546253) Review request for hive, Ning Zhang and Paul Yang. Changes ------- added derby upgrade and revert-the-upgrade script Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When adding or altering a table, create a new column descriptor every time. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs (updated) trunk/metastore/scripts/upgrade/derby/008- HIVE-2246 .derby.sql PRE-CREATION trunk/metastore/scripts/upgrade/derby/008-REVERT- HIVE-2246 .derby.sql PRE-CREATION trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927 trunk/metastore/src/model/package.jdo 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/
          -----------------------------------------------------------

          (Updated 2011-08-08 21:19:06.999293)

          Review request for hive, Ning Zhang and Paul Yang.

          Changes
          -------

          revised description for latest changes

          Summary (updated)
          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          • CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.
          • COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.
          • SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When creating a table, create a new column descriptor every time. When altering a table, only construct a new column descriptor if the columns list has changed.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.
          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs


          trunk/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql PRE-CREATION
          trunk/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql PRE-CREATION
          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927
          trunk/metastore/src/model/package.jdo 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927
          trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927

          Diff: https://reviews.apache.org/r/1183/diff

          Testing
          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-08-08 21:19:06.999293) Review request for hive, Ning Zhang and Paul Yang. Changes ------- revised description for latest changes Summary (updated) ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When creating a table, create a new column descriptor every time. When altering a table, only construct a new column descriptor if the columns list has changed. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs trunk/metastore/scripts/upgrade/derby/008- HIVE-2246 .derby.sql PRE-CREATION trunk/metastore/scripts/upgrade/derby/008-REVERT- HIVE-2246 .derby.sql PRE-CREATION trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927 trunk/metastore/src/model/package.jdo 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1153927 trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 1153927 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Hide
          jiraposter@reviews.apache.org added a comment -

          -----------------------------------------------------------
          This is an automatically generated e-mail. To reply, visit:
          https://reviews.apache.org/r/1183/
          -----------------------------------------------------------

          (Updated 2011-08-08 21:29:23.722825)

          Review request for hive, Ning Zhang and Paul Yang.

          Changes
          -------

          Revert the changes to "describe table T partition P", so that it always shows the table T's schema. If a table's schema has changed, we do not support querying on the old partition's schema at the moment.

          Summary
          -------

          This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert.

          The new schema can be described as follows:

          • CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID.
          • COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs.
          • SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns.

          During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns.

          When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns.

          When creating a table, create a new column descriptor every time. When altering a table, only construct a new column descriptor if the columns list has changed.

          Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables.

          This addresses bug HIVE-2246.
          https://issues.apache.org/jira/browse/HIVE-2246

          Diffs (updated)


          trunk/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql PRE-CREATION
          trunk/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql PRE-CREATION
          trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql PRE-CREATION
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927
          trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION
          trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927
          trunk/metastore/src/model/package.jdo 1153927

          Diff: https://reviews.apache.org/r/1183/diff

          Testing
          -------

          Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB.

          Thanks,

          Sohan

          Show
          jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1183/ ----------------------------------------------------------- (Updated 2011-08-08 21:29:23.722825) Review request for hive, Ning Zhang and Paul Yang. Changes ------- Revert the changes to "describe table T partition P", so that it always shows the table T's schema. If a table's schema has changed, we do not support querying on the old partition's schema at the moment. Summary ------- This patch tries to make minimal changes to the API while keeping migration short and somewhat easy to revert. The new schema can be described as follows: CDS is a table corresponding to Column Descriptor objects. Currently, it only stores a CD_ID. COLUMNS_V2 is a table corresponding to MFieldSchema objects, or columns. A Column Descriptor holds a list of columns. COLUMNS_V2 has a foreign key to the CD_ID to which it belongs. SDS was modified to reference a Column Descriptor. So SDS now has a foreign key to a CD_ID which describes its columns. During migration, we create Column Descriptors for tables in a straightforward manner: their columns are now just wrapped inside a column descriptor. The SDS of partitions use their parent table's column descriptor, since currently a partition and its table share the same list of columns. When altering or adding a partition, give it it's parent table's column descriptor IF the columns they describe are the same. Otherwise, create a new column descriptor for its columns. When creating a table, create a new column descriptor every time. When altering a table, only construct a new column descriptor if the columns list has changed. Whenever you drop a storage descriptor (e.g, when dropping tables or partitions), check to see if the related column descriptor has any other references in the table. That is, check to see if any other storage descriptors point to that column descriptor. If none do, then delete that column descriptor. This check is in place so we don't have unreferenced column descriptors and columns hanging around after schema evolution for tables. This addresses bug HIVE-2246 . https://issues.apache.org/jira/browse/HIVE-2246 Diffs (updated) trunk/metastore/scripts/upgrade/derby/008- HIVE-2246 .derby.sql PRE-CREATION trunk/metastore/scripts/upgrade/derby/008-REVERT- HIVE-2246 .derby.sql PRE-CREATION trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql PRE-CREATION trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 1153927 trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java 1153927 trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java PRE-CREATION trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java 1153927 trunk/metastore/src/model/package.jdo 1153927 Diff: https://reviews.apache.org/r/1183/diff Testing ------- Passes facebook's regression testing and all existing test cases. In one instance, before migration, the overhead involved with storage descriptors and columns was ~11 GB. After migration, the overhead was ~1.5 GB. Thanks, Sohan
          Sohan Jain made changes -
          Attachment HIVE-2246.8.patch [ 12489746 ]
          Hide
          Paul Yang added a comment -

          +1 - tests passed. Will commit.

          Show
          Paul Yang added a comment - +1 - tests passed. Will commit.
          Hide
          Paul Yang added a comment -

          Committed. Thanks Sohan!

          Show
          Paul Yang added a comment - Committed. Thanks Sohan!
          Paul Yang made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Release Note This makes an incompatible change in the metastore DB table schema from previous versions (<0.8). Older metastores created with previous versions of Hive will need to be upgraded with the supplied scripts.
          Fix Version/s 0.8.0 [ 12316178 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #885 (See https://builds.apache.org/job/Hive-trunk-h0.21/885/)
          HIVE-2246. Dedupe tables' column schemas from partitions in the metastore db (Sohan Jain via pauly)

          pauly : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1155573
          Files :

          • /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
          • /hive/trunk/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql
          • /hive/trunk/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql
          • /hive/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java
          • /hive/trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          • /hive/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
          • /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
          • /hive/trunk/metastore/src/model/package.jdo
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #885 (See https://builds.apache.org/job/Hive-trunk-h0.21/885/ ) HIVE-2246 . Dedupe tables' column schemas from partitions in the metastore db (Sohan Jain via pauly) pauly : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1155573 Files : /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java /hive/trunk/metastore/scripts/upgrade/derby/008-REVERT- HIVE-2246 .derby.sql /hive/trunk/metastore/scripts/upgrade/derby/008- HIVE-2246 .derby.sql /hive/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java /hive/trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql /hive/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java /hive/trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java /hive/trunk/metastore/src/model/package.jdo
          Sohan Jain made changes -
          Link This issue requires HIVE-2366 [ HIVE-2366 ]
          Sohan Jain made changes -
          Link This issue is required by HIVE-2367 [ HIVE-2367 ]
          Sohan Jain made changes -
          Link This issue is required by HIVE-2368 [ HIVE-2368 ]
          Hide
          Paul Yang added a comment -

          There has been some issues identified with this patch. We will be doing some additional testing, but we might rollback so that we don't leave trunk in an unstable state.

          Show
          Paul Yang added a comment - There has been some issues identified with this patch. We will be doing some additional testing, but we might rollback so that we don't leave trunk in an unstable state.
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #1059 (See https://builds.apache.org/job/Hive-trunk-h0.21/1059/)
          HIVE-2366. Metastore upgrade scripts for HIVE-2246 do not migrate indexes nor rename the old COLUMNS table (Sohan Jain via Ning Zhang)

          nzhang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197644
          Files :

          • /hive/trunk/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql
          • /hive/trunk/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql
          • /hive/trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #1059 (See https://builds.apache.org/job/Hive-trunk-h0.21/1059/ ) HIVE-2366 . Metastore upgrade scripts for HIVE-2246 do not migrate indexes nor rename the old COLUMNS table (Sohan Jain via Ning Zhang) nzhang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197644 Files : /hive/trunk/metastore/scripts/upgrade/derby/008- HIVE-2246 .derby.sql /hive/trunk/metastore/scripts/upgrade/derby/008-REVERT- HIVE-2246 .derby.sql /hive/trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql
          Hide
          Hudson added a comment -

          Integrated in Hive-0.8.0-SNAPSHOT-h0.21 #82 (See https://builds.apache.org/job/Hive-0.8.0-SNAPSHOT-h0.21/82/)
          HIVE-2366. Metastore upgrade scripts for HIVE-2246 do not migrate indexes nor rename the old COLUMNS table (Sohan Jain via Ning Zhang)

          nzhang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197646
          Files :

          • /hive/branches/branch-0.8/metastore/scripts/upgrade/derby/008-HIVE-2246.derby.sql
          • /hive/branches/branch-0.8/metastore/scripts/upgrade/derby/008-REVERT-HIVE-2246.derby.sql
          • /hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          Show
          Hudson added a comment - Integrated in Hive-0.8.0-SNAPSHOT-h0.21 #82 (See https://builds.apache.org/job/Hive-0.8.0-SNAPSHOT-h0.21/82/ ) HIVE-2366 . Metastore upgrade scripts for HIVE-2246 do not migrate indexes nor rename the old COLUMNS table (Sohan Jain via Ning Zhang) nzhang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197646 Files : /hive/branches/branch-0.8/metastore/scripts/upgrade/derby/008- HIVE-2246 .derby.sql /hive/branches/branch-0.8/metastore/scripts/upgrade/derby/008-REVERT- HIVE-2246 .derby.sql /hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql
          Hide
          Hudson added a comment -

          Integrated in Hive-0.8.0-SNAPSHOT-h0.21 #87 (See https://builds.apache.org/job/Hive-0.8.0-SNAPSHOT-h0.21/87/)
          HIVE-2556. upgrade script 008-HIVE-2246.mysql.sql contains syntax errors. (Ning Zhang via pauly)

          • begin PUBLIC platform impact section -
            Bugzilla: #
          • end platform impact -

          pauly : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1199595
          Files :

          • /hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          Show
          Hudson added a comment - Integrated in Hive-0.8.0-SNAPSHOT-h0.21 #87 (See https://builds.apache.org/job/Hive-0.8.0-SNAPSHOT-h0.21/87/ ) HIVE-2556 . upgrade script 008- HIVE-2246 .mysql.sql contains syntax errors. (Ning Zhang via pauly) begin PUBLIC platform impact section - Bugzilla: # end platform impact - pauly : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1199595 Files : /hive/branches/branch-0.8/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #1070 (See https://builds.apache.org/job/Hive-trunk-h0.21/1070/)
          HIVE-2556. upgrade script 008-HIVE-2246.mysql.sql contains syntax errors. (Ning Zhang via pauly)

          pauly : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1199593
          Files :

          • /hive/trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #1070 (See https://builds.apache.org/job/Hive-trunk-h0.21/1070/ ) HIVE-2556 . upgrade script 008- HIVE-2246 .mysql.sql contains syntax errors. (Ning Zhang via pauly) pauly : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1199593 Files : /hive/trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #1079 (See https://builds.apache.org/job/Hive-trunk-h0.21/1079/)
          HIVE-2568 HIVE-2246 upgrade script needs to drop foreign key in COLUMNS_OLD
          (Ning Zhang via namit)

          namit : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1201091
          Files :

          • /hive/trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #1079 (See https://builds.apache.org/job/Hive-trunk-h0.21/1079/ ) HIVE-2568 HIVE-2246 upgrade script needs to drop foreign key in COLUMNS_OLD (Ning Zhang via namit) namit : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1201091 Files : /hive/trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #1082 (See https://builds.apache.org/job/Hive-trunk-h0.21/1082/)
          HIVE-2572 HIVE-2246 upgrade script changed the COLUMNS_V2.COMMENT length
          (Ning Zhang via namit)

          namit : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1201470
          Files :

          • /hive/trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #1082 (See https://builds.apache.org/job/Hive-trunk-h0.21/1082/ ) HIVE-2572 HIVE-2246 upgrade script changed the COLUMNS_V2.COMMENT length (Ning Zhang via namit) namit : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1201470 Files : /hive/trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql
          Hide
          Namit Jain added a comment -

          Note that there is a bug in the upgrade script. After running this script, the column information for all the partitions is lost. They all inherit the columns from the table definition. It is not a serious problem, as the
          partition column information is not really used by Hive. The only command whose results will change is:

          describe table <T> partition <P>;

          Show
          Namit Jain added a comment - Note that there is a bug in the upgrade script. After running this script, the column information for all the partitions is lost. They all inherit the columns from the table definition. It is not a serious problem, as the partition column information is not really used by Hive. The only command whose results will change is: describe table <T> partition <P>;
          Hide
          Ashutosh Chauhan added a comment -

          Thanks Namit for pointing this out. HCatalog looks into the columns information of partitions, so it will have an issue. Do you have a fix or it or if you can point out which part of script has a bug, we can take a look.

          Show
          Ashutosh Chauhan added a comment - Thanks Namit for pointing this out. HCatalog looks into the columns information of partitions, so it will have an issue. Do you have a fix or it or if you can point out which part of script has a bug, we can take a look.
          Hide
          Ashutosh Chauhan added a comment -

          Also, I assume this is only while upgrading an existing metastore. Newly added partitions after upgrade or new install will not loose any information.

          Show
          Ashutosh Chauhan added a comment - Also, I assume this is only while upgrading an existing metastore. Newly added partitions after upgrade or new install will not loose any information.
          Carl Steinbach made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #1646 (See https://builds.apache.org/job/Hive-trunk-h0.21/1646/)
          HIVE-3424. Error by upgrading a Hive 0.7.0 database to 0.8.0 (008-HIVE-2246.mysql.sql) (Alexander Alten-Lorenz via cws) (Revision 1380483)

          Result = FAILURE
          cws : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1380483
          Files :

          • /hive/trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #1646 (See https://builds.apache.org/job/Hive-trunk-h0.21/1646/ ) HIVE-3424 . Error by upgrading a Hive 0.7.0 database to 0.8.0 (008- HIVE-2246 .mysql.sql) (Alexander Alten-Lorenz via cws) (Revision 1380483) Result = FAILURE cws : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1380483 Files : /hive/trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/)
          HIVE-3424. Error by upgrading a Hive 0.7.0 database to 0.8.0 (008-HIVE-2246.mysql.sql) (Alexander Alten-Lorenz via cws) (Revision 1380483)

          Result = ABORTED
          cws : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1380483
          Files :

          • /hive/trunk/metastore/scripts/upgrade/mysql/008-HIVE-2246.mysql.sql
          Show
          Hudson added a comment - Integrated in Hive-trunk-hadoop2 #54 (See https://builds.apache.org/job/Hive-trunk-hadoop2/54/ ) HIVE-3424 . Error by upgrading a Hive 0.7.0 database to 0.8.0 (008- HIVE-2246 .mysql.sql) (Alexander Alten-Lorenz via cws) (Revision 1380483) Result = ABORTED cws : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1380483 Files : /hive/trunk/metastore/scripts/upgrade/mysql/008- HIVE-2246 .mysql.sql

            People

            • Assignee:
              Sohan Jain
              Reporter:
              Sohan Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development