Hive
  1. Hive
  2. HIVE-1940

Query Optimization Using Column Statistics and Histograms

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None
    • Tags:
      MetaStore

      Description

      The current basis for cost-based query optimization in Hive is information gathered on tables and partitions. To make further improvements in query optimization possible, the next step is to develop and implement possibilities to gather information on columns as discussed in issue HIVE-33. After that, an implementation of histograms is a possible option to use and collect run-time statistics. Next to the actual implementation of these features, it is also necessary to develop a consistent storage model for the MetaStore.

      1. HiveMetaStore.pdf
        221 kB
        Anja Gruenheid
      2. Agruenheid_ideas11.pdf
        253 kB
        Carl Steinbach

        Issue Links

          Activity

          Hide
          Anja Gruenheid added a comment -

          As first step, I would like to take a closer look at collecting meta data on the column level. In issue HIVE-33, five different statistics are described (# distinct values, # null values, 3 min values, 3 max values, avg size of column) that have been proposed as column meta data. As reference, I would take the implementation of the table/partition meta data collection.
          As far as I can tell, deriving histograms is a little bit more complex than obtaining column information, which is why I want to start out with that.

          Is there an up-to-date MetaStore DDL script or an E/R model?

          Show
          Anja Gruenheid added a comment - As first step, I would like to take a closer look at collecting meta data on the column level. In issue HIVE-33 , five different statistics are described (# distinct values, # null values, 3 min values, 3 max values, avg size of column) that have been proposed as column meta data. As reference, I would take the implementation of the table/partition meta data collection. As far as I can tell, deriving histograms is a little bit more complex than obtaining column information, which is why I want to start out with that. Is there an up-to-date MetaStore DDL script or an E/R model?
          Hide
          John Sichi added a comment -

          Hi Anja,

          To get a DDL script, you can install Hive and then get your DBMS to generate a script. For example, with MySQL, you can use the mysqldump utility with --no-data option.

          For Derby, see

          http://www.vogella.de/articles/ApacheDerby/article.html#usage_sqldump

          For an E/R diagram, I had good results with the open source tool Power Architect:

          http://www.sqlpower.ca/page/architect

          (Some manual layout required after reverse engineering.) You can see an example here:

          http://wiki.apache.org/hadoop/Hive/IndexDev#Metastore_Model

          If you produce a diagram for the complete metastore, we can get it published in the wiki for others to use.

          Show
          John Sichi added a comment - Hi Anja, To get a DDL script, you can install Hive and then get your DBMS to generate a script. For example, with MySQL, you can use the mysqldump utility with --no-data option. For Derby, see http://www.vogella.de/articles/ApacheDerby/article.html#usage_sqldump For an E/R diagram, I had good results with the open source tool Power Architect: http://www.sqlpower.ca/page/architect (Some manual layout required after reverse engineering.) You can see an example here: http://wiki.apache.org/hadoop/Hive/IndexDev#Metastore_Model If you produce a diagram for the complete metastore, we can get it published in the wiki for others to use.
          Hide
          Anja Gruenheid added a comment -

          I have set up the last stable version, but as far as I understood, some features have been added during the current iteration, which also have had impact on the design of the MetaStore. Is there an up-to-date overview of the MetaStore somewhere or should I retrace the updates that have been made since the last release?

          If I can collect all the data that I need, I'll create the model.

          Show
          Anja Gruenheid added a comment - I have set up the last stable version, but as far as I understood, some features have been added during the current iteration, which also have had impact on the design of the MetaStore. Is there an up-to-date overview of the MetaStore somewhere or should I retrace the updates that have been made since the last release? If I can collect all the data that I need, I'll create the model.
          Hide
          John Sichi added a comment -

          If you just svn update to the tip of trunk and build/install from there, you'll get the latest metastore. Substantial additions since 0.6 include support for indexes, authorization, and various database properties.

          Show
          John Sichi added a comment - If you just svn update to the tip of trunk and build/install from there, you'll get the latest metastore. Substantial additions since 0.6 include support for indexes, authorization, and various database properties.
          Hide
          Anja Gruenheid added a comment -

          I created the metastore as you suggested, but I'm missing a couple of tables like IDXS. I used MySQL as local database and adjusted the parameters accordingly.
          When I create tables, I can see them in the metastore via MySQL, so it definitely is working.

          Show
          Anja Gruenheid added a comment - I created the metastore as you suggested, but I'm missing a couple of tables like IDXS. I used MySQL as local database and adjusted the parameters accordingly. When I create tables, I can see them in the metastore via MySQL, so it definitely is working.
          Hide
          Anja Gruenheid added a comment -

          I found out that the IDXS metastore tables are generated when I create an index for the first time.

          Show
          Anja Gruenheid added a comment - I found out that the IDXS metastore tables are generated when I create an index for the first time.
          Hide
          Anja Gruenheid added a comment -

          Here is the metastore model that I generated with MySQL Workbench: http://home.in.tum.de/~gruenhei/HiveMetaStore.pdf
          Comparing this model to the one displayed in the index wiki, I noticed the two tables PARTITIONS and PARTITION_KEY_VALS are missing in my model. Do you have any idea how I can create them? I tried adding partitions on tables, but that just created entries in table PARTITION_KEYS.

          Show
          Anja Gruenheid added a comment - Here is the metastore model that I generated with MySQL Workbench: http://home.in.tum.de/~gruenhei/HiveMetaStore.pdf Comparing this model to the one displayed in the index wiki, I noticed the two tables PARTITIONS and PARTITION_KEY_VALS are missing in my model. Do you have any idea how I can create them? I tried adding partitions on tables, but that just created entries in table PARTITION_KEYS.
          Hide
          John Sichi added a comment -

          Awesome diagram! Can you add it as an attachment and check the radio button to grant license to ASF so that we can use it in the Hive wiki?

          Try loading some data into your partitions; maybe it deferred that part of the schema creation until then.

          There's a tool which can force generation of the entire schema:

          http://www.datanucleus.org/products/accessplatform/rdbms/schematool.html

          There's an ant target generate-schema which invokes it (in metastore/build.xml), but it's out-of-date because it still references jpox instead of datanucleus (e.g. it should be invoking org.datanucleus.store.rdbms.SchemaTool instead of org.jpox.SchemaTool). If you get it working, submit a patch and we can update it.

          Show
          John Sichi added a comment - Awesome diagram! Can you add it as an attachment and check the radio button to grant license to ASF so that we can use it in the Hive wiki? Try loading some data into your partitions; maybe it deferred that part of the schema creation until then. There's a tool which can force generation of the entire schema: http://www.datanucleus.org/products/accessplatform/rdbms/schematool.html There's an ant target generate-schema which invokes it (in metastore/build.xml), but it's out-of-date because it still references jpox instead of datanucleus (e.g. it should be invoking org.datanucleus.store.rdbms.SchemaTool instead of org.jpox.SchemaTool). If you get it working, submit a patch and we can update it.
          Hide
          Anja Gruenheid added a comment -

          Hive MetaStore Model - 02/05/2011

          Show
          Anja Gruenheid added a comment - Hive MetaStore Model - 02/05/2011
          Hide
          Anja Gruenheid added a comment -

          I tried figuring out datanucleus and the creation of the initial metastore model, but I don't quite understand it:

          When I create the metastore in MySQL, I generate the jars by running ant model-jar in the hive/metastore folder. When I then run hive, metastore tables are generated according the command that I use (eg show tables) in MySQL. I referenced org.datanucleus.store.rdbms.SchemaTool instead of jpox before generating the jar file, but it didn't change anything. Basically, there has to be an overview of all metastore tables that can possibly be invoked. My question is: where?

          Thanks a lot for your help!

          Show
          Anja Gruenheid added a comment - I tried figuring out datanucleus and the creation of the initial metastore model, but I don't quite understand it: When I create the metastore in MySQL, I generate the jars by running ant model-jar in the hive/metastore folder. When I then run hive, metastore tables are generated according the command that I use (eg show tables) in MySQL. I referenced org.datanucleus.store.rdbms.SchemaTool instead of jpox before generating the jar file, but it didn't change anything. Basically, there has to be an overview of all metastore tables that can possibly be invoked. My question is: where? Thanks a lot for your help!
          Hide
          Carl Steinbach added a comment -

          @Anja: I noticed that you published a paper about Hive column statistics. Any chance you would be willing to share this code?

          Show
          Carl Steinbach added a comment - @Anja: I noticed that you published a paper about Hive column statistics. Any chance you would be willing to share this code?
          Hide
          Carl Steinbach added a comment -

          Resolving this as a duplicated of HIVE-1938 and HIVE-1362.

          Show
          Carl Steinbach added a comment - Resolving this as a duplicated of HIVE-1938 and HIVE-1362 .

            People

            • Assignee:
              Unassigned
              Reporter:
              Anja Gruenheid
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development