Hive
  1. Hive
  2. HIVE-2246

Dedupe tables' column schemas from partitions in the metastore db

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: Metastore
    • Labels:
      None
    • Release Note:
      This makes an incompatible change in the metastore DB table schema from previous versions (<0.8). Older metastores created with previous versions of Hive will need to be upgraded with the supplied scripts.
    • Tags:
      metastore, schema, JDO

      Description

      Note: this patch proposes a schema change, and is therefore incompatible with the current metastore.

      We can re-organize the JDO models to reduce space usage to keep the metastore scalable for the future. Currently, partitions are the fastest growing objects in the metastore, and the metastore keeps a separate copy of the columns list for each partition. We can normalize the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate lists of the columns for each partition.

      An idea is to create an additional level of indirection with a "Column Descriptor" that has a list of columns. A table has a reference to its latest Column Descriptor (note: a table may have more than one Column Descriptor in the case of schema evolution). Partitions and Indexes can reference the same Column Descriptors as their parent table.

      Currently, the COLUMNS table in the metastore has roughly (number of partitions + number of tables) * (average number of columns pertable) rows. We can reduce this to (number of tables) * (average number of columns per table) rows, while incurring a small cost proportional to the number of tables to store the Column Descriptors.

      Please see the latest review board for additional implementation details.

      1. HIVE-2246.2.patch
        15 kB
        Sohan Jain
      2. HIVE-2246.3.patch
        23 kB
        Sohan Jain
      3. HIVE-2246.4.patch
        29 kB
        Sohan Jain
      4. HIVE-2246.8.patch
        29 kB
        Sohan Jain

        Issue Links

          Activity

          Sohan Jain created issue -
          Sohan Jain made changes -
          Field Original Value New Value
          Attachment HIVE-2246.2.patch [ 12487393 ]
          Sohan Jain made changes -
          Tags metastore, schema, JDO
          Description We can re-organize the JDO models to reduce space usage to keep the metastore scalable for the future. Currently, partitions are the fastest growing objects in the metastore, and the metastore keeps a separate copy of the columns list for each partition. We can normalize the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate lists of the columns for each partition.

          An idea is to create an additional level of indirection with a "Column Descriptor" that has a list of columns. A table has a reference to its latest Column Descriptor (note: a table may have more than one Column Descriptor in the case of schema evolution). Partitions and Indexes can reference the same Column Descriptors as their parent table.

          Currently, the COLUMNS table in the metastore has roughly (number of partitions + number of tables) * (average number of columns pertable) rows. We can reduce this to (number of tables) * (average number of columns per table) rows, while incurring a small cost proportional to the number of tables to store the Column Descriptors.
          Note: this patch proposes a schema change, and is therefore incompatible with the current metastore.

          We can re-organize the JDO models to reduce space usage to keep the metastore scalable for the future. Currently, partitions are the fastest growing objects in the metastore, and the metastore keeps a separate copy of the columns list for each partition. We can normalize the metastore db by decoupling Columns from Storage Descriptors and not storing duplicate lists of the columns for each partition.

          An idea is to create an additional level of indirection with a "Column Descriptor" that has a list of columns. A table has a reference to its latest Column Descriptor (note: a table may have more than one Column Descriptor in the case of schema evolution). Partitions and Indexes can reference the same Column Descriptors as their parent table.

          Currently, the COLUMNS table in the metastore has roughly (number of partitions + number of tables) * (average number of columns pertable) rows. We can reduce this to (number of tables) * (average number of columns per table) rows, while incurring a small cost proportional to the number of tables to store the Column Descriptors.

          Please see the latest review board for additional implementation details.
          Sohan Jain made changes -
          Attachment HIVE-2246.3.patch [ 12487394 ]
          Sohan Jain made changes -
          Attachment HIVE-2246.4.patch [ 12489531 ]
          Sohan Jain made changes -
          Attachment HIVE-2246.8.patch [ 12489746 ]
          Paul Yang made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Release Note This makes an incompatible change in the metastore DB table schema from previous versions (<0.8). Older metastores created with previous versions of Hive will need to be upgraded with the supplied scripts.
          Fix Version/s 0.8.0 [ 12316178 ]
          Resolution Fixed [ 1 ]
          Sohan Jain made changes -
          Link This issue requires HIVE-2366 [ HIVE-2366 ]
          Sohan Jain made changes -
          Link This issue is required by HIVE-2367 [ HIVE-2367 ]
          Sohan Jain made changes -
          Link This issue is required by HIVE-2368 [ HIVE-2368 ]
          Carl Steinbach made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Sohan Jain
              Reporter:
              Sohan Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development